10/10/2020

Outline

  • General introduction to data visualization

  • Introduction to ggplot2

  • Grammar of graphics

  • A case study using NBA data

  • Useful packages and extentions for ggplot2

WHEN and WHY to visualize data

  • Exploratory data analysis

    • Explore pattern, trend, and distribution of one variable

    • Explore association between variables

  • Statistical analysis

    • Diagnostic plots for linear regression
  • Report your results and communicate with non-statisticians

    • A more clear way of presenting findings

    • Attract your audiences

  • For fun…

An example for fun

An example for fun

## # A tibble: 12 x 6
##    dataset    meanx meany   sdx   sdy  corxy
##    <chr>      <dbl> <dbl> <dbl> <dbl>  <dbl>
##  1 bullseye    54.3  47.8  16.8  26.9 -0.069
##  2 circle      54.3  47.8  16.8  26.9 -0.068
##  3 dino        54.3  47.8  16.8  26.9 -0.064
##  4 dots        54.3  47.8  16.8  26.9 -0.06 
##  5 h_lines     54.3  47.8  16.8  26.9 -0.062
##  6 high_lines  54.3  47.8  16.8  26.9 -0.069
##  7 slant_down  54.3  47.8  16.8  26.9 -0.069
##  8 slant_up    54.3  47.8  16.8  26.9 -0.069
##  9 star        54.3  47.8  16.8  26.9 -0.063
## 10 v_lines     54.3  47.8  16.8  26.9 -0.069
## 11 wide_lines  54.3  47.8  16.8  26.9 -0.067
## 12 x_shape     54.3  47.8  16.8  26.9 -0.066

A statistical plot contains much more information than a table of summary statistics!

What to plot

  • One variable: Histogram, Bar chart, Density plot…

  • Two variables: Scatter plot, Box plot, Violin Plot…

  • Multiple variables: Heatmap…

  • Checking normality: qqplot…

  • Think of your data and variables carefully, and choose the most appropriate statistical plot.

Introduction to the data set

##             TEAM    SEASON  WIN.   PTS OFFRTG DEFRTG   PACE REGION ABV
## 1  Atlanta Hawks 2015-2016 0.585 102.8  104.6  100.8  97.63   East ATL
## 2  Atlanta Hawks 2018-2019 0.354 113.3  107.5  113.1 104.56   East ATL
## 3  Atlanta Hawks 2017-2018 0.293 103.4  104.4  110.1  98.76   East ATL
## 4  Atlanta Hawks 2016-2017 0.524 103.2  104.5  105.2  97.76   East ATL
## 5 Boston Celtics 2015-2016 0.585 105.7  105.8  102.5  99.43   East BOS
## 6 Boston Celtics 2016-2017 0.646 108.0  110.6  108.0  97.21   East BOS
  • WIN.: Winning rate, which is the percentage of games played that a team has won.

  • PTS: The number of points scored.

  • OFFRTG: Offensive Rating, which measures a team’s points scored per 100 possessions.

  • DEFRTG: Defensive Rating, which is the number of points allowed per 100 possessions by a team.

  • PACE: Pace, which is the number of possessions per 48 minutes for a team.

  • REGION: East/West.

  • ABV: The abbreviation of a team.

A “good” plot can deliver a lot of information

But “bad” plots may…

  • be hard to read if labels and legends are not clear

  • confuse people if it is not well-designed

  • deliver misleading information (sometimes in purpose)

Visualization tools in R - A histogram example

The histograms of winning rate in different regular NBA seasons and regions generated by ggplot2 and graphics packages:

Comparing codes for the same plot

Code in ggplot2:

ggplot(data = sub.dt, aes(x = WIN.)) + 
  geom_histogram(binwidth = 0.1, color = "black") + facet_grid(REGION ~ SEASON)

Code in graphics package

par(mfrow = c(2, 2), mar = c(2, 2, 3, 1))
for(i in levels(sub.dt$REGION)){
  for(j in levels(sub.dt$SEASON)){
    subdata <- subset(sub.dt, REGION == i & SEASON == j)
    hist(sub.dt$WIN., breaks = seq(0, 1, 0.1),
         main = paste(i, j, sep = " ,"))
  }
}

Grammar of Graphics

  • Idea: graph is a combination of independent building blocks.

  • Data that you want to visualise and a set of aesthetic mappings describing how variables in the data are mapped to aesthetic attributes.

  • Layers made up of geometric elements and statistical transformation. Geometric objects, geoms for short, such as points, lines, polygons, etc. Statistical transformations, stats for short, summarise data in many useful ways.

  • The scales map values in the data space to values in an aesthetic space, whether it be colour, or size, or shape.

  • A coordinate system, coord for short, describes how data coordinates are mapped to the plane of the graphic.

  • A facet describes how to break up the data into subsets and how to display those subsets as small multiples.

  • A theme which controls the finer points of display, like the font size and background colour.

The start of plotting a graph

  • ggplot() is always the first line of your code.

  • We can specify the data set and the aesthetics mapping variables in the ggplot().

p <- ggplot(data = nba.data, aes(x = OFFRTG, y = WIN.))
p

Aesthetics

  • Map the variables in the data to the components in the plot

  • x: x axis

  • y: y axis

  • color: color of the boundary of a symbol

  • fill: color of the inside of a symbol

  • shape: shape of points, solid point, circle, triangle…

  • size: size of points

  • linetype: type of lines, solid line, dashed line…

  • …

Geometries

  • Geometries are the actual graphical elements displayed in a plot. They can visualize the mapping variables (specified in aes()) from the data.

  • We use + to connect multiple geometrics functions

p + geom_point()

Geometries

  • We can also specify data and aes in geom function. They don’t have to be the same as those in ggplot().
ggplot() + geom_point(data = nba.data, aes(x = DEFRTG, y = WIN.))

geom function

  • One continuous variable
p <- ggplot(data = nba.data, aes(x = WIN.))
p + geom_histogram(binwidth = 0.1)
p + geom_density()

geom function

  • Continuous X, continuous Y
p <- ggplot(data = nba.data, aes(x = OFFRTG, y = WIN.))
p + geom_point(); p + geom_line(); p + geom_density_2d(); p + geom_smooth(formula = y ~ x, method = "lm")

geom function

  • Discrete X, continuous Y
p <- ggplot(data = nba.data, aes(x = SEASON, y = WIN.))
p + geom_boxplot()
p + geom_violin()

Multiple geom layers

ggplot(data = nba.data, aes(x = WIN.)) +
  geom_histogram(aes(y = ..density..), binwidth = 0.1, color = "black") +
  geom_density()

Multiple geom layers

ggplot(data = nba.data, aes(x = OFFRTG, y = WIN.)) +
  geom_point() + 
  geom_smooth(formula = y ~ x, method = "lm")

Multiple geom layers

ggplot(data = nba.data, aes(x = SEASON, y = WIN.)) +
  geom_violin() +
  geom_boxplot(width = 0.2)

The order of geom functions is important

ggplot(data = nba.data, aes(x = SEASON, y = WIN.)) +
  geom_boxplot(width = 0.2) +
  geom_violin()

Facet

  • Facet function can help you make panel plot very easily

  • facet_wrap wraps a 1d sequence of panels into 2d.

p <- ggplot(data = nba.data, aes(x = OFFRTG, y = WIN.)) +
  geom_point() + geom_smooth(formula = y ~ x, method = "lm", se = FALSE)
p + facet_wrap(~SEASON)

Facet

  • facet_grid forms a matrix of panels defined by row and column faceting variables.
p <- ggplot(data = nba.data, aes(x = OFFRTG, y = WIN.)) +
  geom_point() + geom_smooth(formula = y ~ x, method = "lm", se = FALSE)
p + facet_grid(REGION ~ SEASON)

Scale

  • The scale functions control how the plot maps data values to the visual values of an aesthetic, for instance,

    • scale_x_continuous

    • scale_y_discrete

    • scale_color_gradient

    • scale_fill_manual

  • The format of scale functions is always scale_element1_element2. The first element represents the aesthetics, and the second element represents the characteristics of variables.

  • You can also specify the label of axis or legends in the scale funtion.

  • R color cheatsheet: https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/colorPaletteCheatsheet.pdf

Scale

p <- ggplot(data = nba.data) + 
  geom_point(aes(x = OFFRTG, y = DEFRTG, color = WIN., shape = REGION))
p + scale_x_continuous(name = "offensive rate", limits = c(97, 116)) +
  scale_y_reverse(name = "defensive rate") +
  scale_color_gradient(name = "winning rate", low = "green", high = "red") +
  scale_shape_discrete(name = "region", labels = c("EAST", "WEST"))

Design your own plot

  • coord_* function control the transformation of the coordinate systems, such as coord_trans(y = "sqrt").

  • We can change the theme of plot using theme_* function

  • labs function can set the title, subtitle and caption of your plot.

  • theme function is a powerful way to customize the non-data components of your plots: i.e. titles, labels, fonts, background, gridlines, and legends. See R help for details.

  • ggsave can save the plot to your local drive.

ggplot2 online documents

A case study

A case study

ggplot(data = nba.data, aes(x = OFFRTG, y = DEFRTG, size = WIN.)) +
  geom_point(aes(color = REGION), shape = 1) + 
  geom_text(data = subset(nba.data, WIN. > 0.65),
            aes(label = ABV), size = 1.5) +
  geom_text_repel(data = subset(nba.data, WIN. < 0.3),
            aes(label = ABV), size = 1.5, 
            min.segment.length = 0, box.padding = 0.3) +
  facet_wrap(~SEASON) +
  theme_bw() +
  scale_x_continuous("Offensive Rate") +
  scale_y_reverse("Defensive Rate", limits = c(118, 95)) +
  scale_color_manual("Region" ,values = c("blue3", "red3")) +
  scale_size_continuous("Winning Rate", breaks = c(0.2, 0.4, 0.6)) +
  theme(legend.position = "bottom",
    panel.grid.minor = element_blank())

Useful packages or extentions for ggplot2

  • gridExtra: A package can help you arrange multiple plots on a page

  • GGally: An extention to reduce the complexity of combining geometric objects with transformed data

  • ggExtra: A package which can add marginal density plots or histograms to ggplot2 scatterplots.

  • ggrepel: A convenient package for geom_text()

  • gganimate: A grammar of animated graphics

  • more information: http://www.ggplot2-exts.org/gallery/

GGally

  • ggpairs: Make a matrix of plots with a given data set.

  • ggcorr: plot a correlation matrix (heatmap) with ggplot2

ggpairs(data = nba.data, 3:7)
ggcorr(data = nba.data[, 3:7])

ggExtra

  • ggMarginal: Create a ggplot2 scatterplot with marginal density plots (default) or histograms, or add the marginal plots to an existing scatterplot.
p <- ggplot(nba.data, aes(x = OFFRTG, y = DEFRTG, color = REGION)) +
  geom_point() + theme_bw() + theme(legend.position = "bottom")
ggMarginal(p, groupColour = TRUE, groupFill = TRUE)

gganimate

gganimate

ggplot(data = nba.data, aes(x = OFFRTG, y = DEFRTG, size = WIN.)) +
  geom_point(aes(color = REGION), shape = 1) + 
  geom_text_repel(aes(label = ABV), size = 1.5, box.padding = 0.3) +
  theme_bw() +
  scale_y_reverse(limits = c(120, 97)) +
  scale_color_manual(values = c("blue3", "red3")) +
  # Here comes the gganimate specific bits
  labs(title = 'SEASON: {closest_state}', x = 'OFFRTG', y = 'DEFRTG') +
  theme(title = element_text(size = 5), 
        text = element_text(size = 2)) +
  transition_states(SEASON,
                    transition_length = 2,
                    state_length = 1)

Thanks for listening!